Extract content links from markdown nodes by mindsocket · Pull Request #98 · mindsocket/structured-context

mindsocket · 2026-04-21T00:55:30Z

Closes #74

Summary

Adds a ContentLink type representing links extracted from node body and frontmatter (wikilinks, markdown links, images, bare URLs)
New extract-content-links.ts module handles parsing via remark AST for body content and regex scanning for frontmatter string values
Graph edge fields (hierarchy + relationship fields) are excluded from frontmatter extraction to avoid duplicating structural edges
Links are resolved against the space to classify as space-internal, space-external, or external URLs, and exposed on SpaceNode as contentLinks
Validation rules can now target contentLinks fields; schema and validate modules updated accordingly

Test plan

bun run test — unit tests pass including new tests/read/content-links.test.ts
bun run typecheck — no type errors
Verify contentLinks is populated on nodes when dumping a space with links in content

Adds a ContentLink type and extracts wikilinks, markdown links, images, and bare URLs from node body and frontmatter fields. Links are resolved to indicate whether they are space-internal, space-external, or external URLs, and exposed on SpaceNode as contentLinks. Graph edge fields are excluded from frontmatter link extraction to avoid duplication.

claude · 2026-04-21T00:55:45Z

Claude finished @mindsocket's task in 2m 27s —— View job

Code Review

Tasks

Gather context and understand the implementation
Review core extraction logic
Review type definitions and API changes
Review test coverage
Run test suite (✅ 430 tests passed)
Run type checking (✅ no errors)
Provide comprehensive feedback

Overall Assessment

This is a well-designed and carefully implemented feature that adds content link extraction capabilities to the structured-context system. The implementation follows existing patterns, has comprehensive test coverage, and properly handles edge cases like excluding graph edge fields from extraction.

Strengths:

Clean separation of concerns with a dedicated extract-content-links.ts module
Comprehensive test coverage (347 lines of new tests in content-links.test.ts)
Proper type definitions with clear documentation
Smart exclusion of graph edge fields to avoid duplicating structural edges
Integration with both frontmatter and body content parsing
Proper link classification (node, internal, external, system, protocol)

Detailed Review by Component

1. Core Extraction Logic (`extract-content-links.ts`)

✅ Well-implemented:

parseWikilinkInner correctly handles aliases and anchors (lines 12-32)
extractLinksFromAstNode properly walks the mdast tree (lines 108-152)
getEdgeFieldNames correctly identifies hierarchy and relationship fields (lines 167-177)
extractLinksFromFrontmatter properly excludes edge fields (lines 183-198)

⚠️ Minor concerns:

Regex portability issue (line 66):

// Negative lookbehind: skip URLs immediately preceded by `](` (already a markdown link target)
for (const match of text.matchAll(/(?<!\]\()https?:\/\/[^\s\])"<>]+/g)) {

The negative lookbehind (?<!\]\() is not supported in all JavaScript environments (e.g., older browsers, some Node.js versions). While this likely works in the project's target environment, consider an alternative approach if broader compatibility is needed:

Suggestion →

// Alternative: post-process to filter out URLs already in markdown links
const allUrls = text.matchAll(/https?:\/\/[^\s\])"<>]+/g);
const markdownLinkUrls = text.matchAll(/\[([^\]]*)\]\((https?:\/\/[^)]+)\)/g);
const markdownUrlSet = new Set(Array.from(markdownLinkUrls, m => m[2]));
for (const match of allUrls) {
  if (!markdownUrlSet.has(match[0])) {
    links.push({ text: match[0], target: match[0], action: 'link', linkSyntax: 'markdown' });
  }
}

URL pattern completeness (line 66):
The regex /https?:\/\/[^\s\])"<>]+/ excludes ), ], ", <, > from URLs. This is reasonable for markdown contexts but might miss valid URLs containing these characters (e.g., Wikipedia URLs with parentheses). This is acceptable for the current use case but worth noting.

2. Type Definitions (`types.ts`)

✅ Excellent:

ContentLink type is well-documented with clear field descriptions (lines 74-85)
ResolvedContentLink properly extends ContentLink with location classification (lines 91-109)
BaseNode and SpaceNode types are properly extended with contentLinks and resolvedLinks (lines 115-151)

📝 Documentation suggestion:
The ResolvedContentLink.location field documents 5 possible values but doesn't explicitly mention that 'node' is for wikilinks resolved to known space nodes. The comment is there, but could be more prominent (lines 101-107).

3. Graph Resolution (`resolve-graph-edges.ts`)

✅ Well-implemented:

classifyContentLink correctly handles both wikilink and markdown link classification (lines 156-177)
Proper integration with existing resolveGraphEdges function (lines 276-279)
Wikilink resolution correctly handles anchors (lines 162-164)

✅ Smart design:
The decision to check both target#anchor and bare target in the index (line 163) is clever - it allows block references to resolve even if the node doesn't explicitly list all block anchors in linkTargets.

4. Validation Integration (`validate.ts`)

✅ Proper implementation:

Unresolved content links are collected as warnings, not errors (lines 239-251)
Deduplication prevents duplicate warnings (line 244)
Integration with validateFile properly separates errors from warnings (lines 378-392)

✅ User-friendly output:
The validation output in validate.ts (lines 113-124) and validate.ts command (lines 104, 113-124) clearly shows unresolved content links as warnings rather than errors.

5. Test Coverage (`content-links.test.ts`)

✅ Comprehensive:

Unit tests for all extraction functions (lines 31-142)
Integration tests for end-to-end link classification (lines 148-235)
Edge cases covered: anchors, aliases, embeds, protocols, relative paths (lines 170-219)
Proper verification that edge fields are excluded (lines 221-225)

✅ Well-structured:
Tests are organized logically by function and include clear assertions.

6. API Changes (`api.ts`)

✅ Appropriate exports:

ContentLink and ResolvedContentLink are exported for public use (lines 29, 36)
Consistent with existing API patterns

7. CLI Output Changes

✅ Good UX:

Validation now shows warnings separately from errors (validate-file.ts, lines 29-41)
Warnings are displayed in yellow to distinguish from errors (line 12)
Unresolved content links are properly categorized as warnings (validate.ts, line 104)

Integration Points

✅ Markdown plugin integration:

read-space.ts properly extracts links during parsing (line 142)
parse-embedded.ts extracts links from embedded nodes (lines 835-836, 849-850)

✅ Schema integration:

schemas.ts change to use resolveRef for descriptions (lines 38-39) is an improvement but seems unrelated to this feature. Consider if this should be a separate commit.

Potential Issues & Suggestions

Rule evaluation access:
The resolvedLinks field is not currently exposed to rule evaluation in evaluate-rule.ts's flattenNode function. This means rules cannot directly query content links. If this is intentional, consider documenting it. If not, this could be a future enhancement.
Performance consideration:
For large spaces with many nodes and links, the link extraction and resolution could impact performance. Consider if any caching or optimization would be beneficial for very large spaces (1000+ nodes).
Documentation updates:
While docs/concepts.md was updated, consider adding examples of how to use contentLinks in validation rules once rule evaluation access is added.

Verification Results

Test Suite: ✅ All 430 tests passed
Type Checking: ✅ No type errors
Code Quality: ✅ Follows existing patterns and conventions

Summary

This is a high-quality implementation that is ready to merge. The code is well-tested, properly typed, and thoughtfully designed. The minor concerns about regex portability are not blocking issues given the project's likely target environment.

Recommendation: ✅ Approve with minor suggestions

The implementation successfully achieves its goals:

✅ Extracts links from both body and frontmatter
✅ Properly classifies links by location
✅ Excludes graph edge fields to avoid duplication
✅ Integrates cleanly with existing validation system
✅ Provides comprehensive test coverage

mindsocket

There are code simplification and consistency issues to address before this can be merged.

mindsocket · 2026-04-21T04:56:17Z

+  for (const [key, value] of Object.entries(data)) {
+    if (edgeFields.has(key)) continue;
+    if (typeof value === 'string') {
+      links.push(...extractLinksFromText(value));


Couldn't this use mdast to parse the value as markdown and walk the content for link types? That keeps link parsing consistent across frontmatter and content.

mindsocket · 2026-04-21T04:58:41Z

+ * A fully classified link from node content, after location resolution.
+ * Present on SpaceNode after graph resolution.
+ */
+export type ResolvedContentLink = {


These types should be composed from a base type since they're largely the same.

- Extract ContentLinkBase to compose ContentLink and ResolvedContentLink from shared fields rather than duplicating them - Use mdast (extractLinksFromBody) for frontmatter string values instead of separate regex helpers, keeping link extraction consistent across body and frontmatter - Expose resolvedLinks on flattened nodes so JSONata rules can query content links - Remove extractLinksFromText and its regex helper functions

claude · 2026-04-21T07:07:36Z

Claude finished @mindsocket's task in 2m 22s —— View job

Code Review

Tasks

Gather context and understand the implementation
Review frontmatter link extraction approach
Review type composition
Verify test suite passes
Provide review findings

Review Summary

The PR author's review comments have already been fully addressed in commit 4303ceb ("Address PR #98 review: type composition, consistent parsing, rule access"). Both concerns raised have been resolved:

✅ Issue 1: Frontmatter Link Extraction

Review comment: "Couldn't this use mdast to parse the value as markdown and walk the content for link types? That keeps link parsing consistent across frontmatter and content."

Status: ✅ Fixed

The initial implementation used regex-based parsing in extractLinksFromFrontmatter() (commit cca0c25). The current version now uses mdast parsing for consistency:

Current implementation (extract-content-links.ts:139-154):

export function extractLinksFromFrontmatter(data: Record<string, unknown>, edgeFields: Set<string>): ContentLink[] {
  const links: ContentLink[] = [];
  for (const [key, value] of Object.entries(data)) {
    if (edgeFields.has(key)) continue;
    if (typeof value === 'string') {
      links.push(...extractLinksFromBody(value));  // ✅ Uses mdast parsing
    } else if (Array.isArray(value)) {
      for (const item of value) {
        if (typeof item === 'string') {
          links.push(...extractLinksFromBody(item));  // ✅ Uses mdast parsing
        }
      }
    }
  }
  return links;
}

The extractLinksFromBody() function (lines 113-116) uses the unified processor with remark:

export function extractLinksFromBody(body: string): ContentLink[] {
  const tree = unified().use(remarkParse).use(remarkGfm).parse(body) as Root;
  return extractLinksFromAstNode(tree);
}

This ensures frontmatter and body content are parsed consistently using the same mdast-based approach.

✅ Issue 2: Type Composition

Review comment: "These types should be composed from a base type since they're largely the same."

Status: ✅ Fixed

The initial implementation duplicated all shared fields between ContentLink and ResolvedContentLink. The current version uses proper type composition:

Current implementation (types.ts:70-107):

/**
 * Shared fields for all content link variants.
 */
type ContentLinkBase = {
  /** Display text of the link (text content for markdown links, inner text or target for wikilinks). */
  text: string;
  /** Raw link target: URL for markdown links, wikilink path (without [[ ]]) for wikilinks. */
  target: string;
  /** Whether the link navigates to a resource or transcludes/embeds it. */
  action: 'link' | 'embed';
  /** Anchor fragment (heading or block ref, e.g. 'heading-title' or '^blockid') if present. */
  anchor?: string;
};

/**
 * A link extracted from node content at parse time, before location classification.
 * Populated by plugins that support content link extraction (e.g. the markdown plugin).
 */
export type ContentLink = ContentLinkBase & {
  /** Link syntax — used during resolution to determine how to classify the target. */
  linkSyntax: 'wikilink' | 'markdown';
};

/**
 * A fully classified link from node content, after location resolution.
 * Present on SpaceNode after graph resolution.
 */
export type ResolvedContentLink = ContentLinkBase & {
  /**
   * Resolved location classification:
   * - 'node'     — wikilink resolved to a known space node
   * - 'internal' — wikilink or relative path, in the source system but not a space node
   * - 'external' — http/https URL
   * - 'system'   — file:// URL
   * - 'protocol' — other scheme (e.g. obsidian://)
   */
  location: 'node' | 'internal' | 'external' | 'system' | 'protocol';
};

This is a clean, maintainable approach that:

Eliminates duplication of shared fields (text, target, action, anchor)
Makes the relationship between the types explicit
Follows TypeScript best practices for type composition

Verification

Test Suite: ✅ All 423 tests passed
Type Checking: ✅ No errors (previous review confirmed)
Code Quality: ✅ Follows existing patterns and conventions

Conclusion

Both review concerns have been successfully addressed. The implementation now:

✅ Uses consistent mdast-based parsing for both frontmatter and body content
✅ Uses proper type composition with ContentLinkBase to eliminate duplication
✅ Maintains comprehensive test coverage
✅ Follows TypeScript and codebase best practices

Recommendation: ✅ Ready to merge

The PR successfully implements content link extraction with clean, maintainable code that addresses all review feedback.

mindsocket commented Apr 21, 2026

View reviewed changes

mindsocket merged commit c2d4d5f into main Apr 21, 2026
2 checks passed

mindsocket deleted the issue-74-content-links branch April 21, 2026 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract content links from markdown nodes#98

Extract content links from markdown nodes#98
mindsocket merged 2 commits intomainfrom
issue-74-content-links

mindsocket commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

mindsocket left a comment

Uh oh!

mindsocket Apr 21, 2026

Uh oh!

mindsocket Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mindsocket commented Apr 21, 2026

Summary

Test plan

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Tasks

Overall Assessment

Detailed Review by Component

1. Core Extraction Logic (extract-content-links.ts)

2. Type Definitions (types.ts)

3. Graph Resolution (resolve-graph-edges.ts)

4. Validation Integration (validate.ts)

5. Test Coverage (content-links.test.ts)

6. API Changes (api.ts)

7. CLI Output Changes

Integration Points

Potential Issues & Suggestions

Verification Results

Summary

Uh oh!

mindsocket left a comment

Choose a reason for hiding this comment

Uh oh!

mindsocket Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

mindsocket Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Tasks

Review Summary

✅ Issue 1: Frontmatter Link Extraction

✅ Issue 2: Type Composition

Verification

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 21, 2026 •

edited

Loading

1. Core Extraction Logic (`extract-content-links.ts`)

2. Type Definitions (`types.ts`)

3. Graph Resolution (`resolve-graph-edges.ts`)

4. Validation Integration (`validate.ts`)

5. Test Coverage (`content-links.test.ts`)

6. API Changes (`api.ts`)

claude Bot commented Apr 21, 2026 •

edited

Loading